Understanding Linear Regression
I find the best way to understand linear regressions is as recipes for drawing a graph. Let’s start off with an example relevant to my own life: the relationship between temperatures in Celcius and temperatures in Farenheit. There are two important numeric constants to convert back and forth between them:
- Freezing = 0°C = 32°F
- A change of 1°C = 1.8°F
This gives us all of the necessary components for drawing a graph of the relationship between Farenheit and Celcius. First, we know that a line plotting this relationship must cross 0 on the x-axis at 32 on the y-axis, because freezing = 0°C = 32°F.

Next, we know that for 1 every degree we move to the right, we must move up 1.8.

The line, representing this relationship, thus has the formula:
\[
y = 32 + (1.8\times x)
\]
And looks like

This line has the following properties:
- Its Intercept, or where the line crosses 0 on the x-axis, is 32.
- Its Slope, is 1.8 (that is, a change of 1 on the x-axis corresponds to a change in 1.8 on the y-axis)
And, in fact, all lines, including the ones we fit using linear regression, are described in terms of their intercept (where the line crosses 0 on the x-axis) and slope.
But, let’s say you didn’t know the formula to convert back and forth between Celcius and Farenheit. Maybe you weren’t even sure that there was such a formula. And all you had were a bunch of kind of crappy thermometers that gave you both Celcius and Farenheit readings. And maybe the thermometers are kind of hard to read, or the lighting conditions are dark, or you only have limited experience in reading thermometers. In that case, the relationship between Celcius and Farenheit might look a lot more like this:

In this case, you might want to fit a linear regression. A linear regression assumes that the relationship between two variables can be captured using some line, and it tries to estimate the Intercept and Slope of this assumed line.
The way it does this is by trying to find parameters for a line that minimize the “error” or the “residuals” between the observed data, and the line. For example, here are examples of bad lines. The first assumes that all temperatures Celcius are basically 50°F. The other assumes that all temperatures in Farenheit are basically 3\(\times\)C.
Here is a plot of the best-fitting line, that has the smallest average residuals.

There are actually many different ways to figure out what the best line is (remembering that by “line” we really mean the pairing of Intercept and Slope). For example, one really simplistic approach would be just to draw a bunch of different lines, and see which one worked best.

However, there have been some long established methods for solving this problem, at is, for finding the one and truly best Intercept and Slope for a given set of data points.
Idiom
This isn’t R idiom, but stats idiom. Here are some translations for how people describe solving this problem.
- analytic, closed form = “We solved it with complicated math.”
- iterative, stochastic = “We made a bunch of guesses.”
For the sake of finishing the illustration, here is the solution that a linear regression found for this sample data:
Call:
lm(formula = f_observe ~ c_observe, data = temp_df)
Coefficients:
(Intercept) c_observe
39.275 1.382
The model has estimated that the Intercept is \(\approx\) 39 and the slope is \(\approx\) 1.4 . That is to say, based on our “sample”, it thinks 0°C = 39°F, and that every 1°C increase corresponds to a 1.4°F increase.
We are lucky to know that this is, strictly speaking, wrong, but we won’t usually, or typically, be able to in such a situation. However, the estimate of both the intercept and slope have the right sign (that is, it hasn’t guessed the intercept is -10, or the slope goes the other way), and they’re just about the right magnitude (that is, it hasn’t guessed that the slope is something like 10, or 100).
Now, let’s consider how the linear model is describing the value of this highlighted dot.

First, you take it’s value in Celcius, and multiply it by 1.3820366 (which is 28.47)

Then,you add 39.2752753 to it (which is 67.74)

Then, you add the residual error to give us the observed data point.

The generic formula for finding the value for any data point is
\[
F_i = 39.3 + (1.4\times C_i) + \epsilon_i.
\] In this formula the \(i\) is convention for referring to some generic, \(i^{th}\) data point and \(\epsilon\) is conventional for referring to the residual errors.
Some additional conventions:
- Referring to the outcome variable as \(y\).
- Referring to the predictors as \(x\).
- Referring to the remaining parameters as \(\beta\).
So you may sometimes see people referring to \(\beta_0\), or the “betas” in a model. The fully generic formula for linear regression is sometimes given as
\[
y_i = \beta_0 + \beta x_i + \epsilon_i
\]
Idiom
- We call the Intercept and Slopes the parameters or coefficients of the model.
- For every data point, the value you get by only using the slopes and intercepts are called the fitted values, and they all lie along the lines you draw.
- Almost every data point’s actual value is different from its fitted value, and these differences are called the residuals.
Understanding categorical predictors
Categorical variables are necessarilly treated differently in regression models. This is because while you can multiply numeric predictors together, you can’t multiply 5 \(\times\) “experimental condition”.
To illustrate how categorical predictors work in a linear regression, we’ll use some example weather data. I pulled this data of the daily temperature from the University of Edinburgh Geosciences Department weather station.

In order to focus on a categorical effect, I’ve recoded the date variable into “seasons” like so:
| Jan |
Winter |
| Feb |
Winter |
| Mar |
Spring |
| Apr |
Spring |
| May |
Spring |
| Jun |
Summer |
| Jul |
Summer |
| Aug |
Summer |
| Sep |
Autumn |
| Oct |
Autumn |
| Nov |
Autumn |
| Dec |
Winter |
We’ll focus on just the subset of 2010 and 2011, because 2010 was a notably cold winter. Here’s all of the data for the four seasons from 2010 and 2011.

Winter is the “reference level”, for the season’s factor, which you can see if you use levels().
levels(temps_comp$season)
[1] "Winter" "Spring" "Summer" "Autumn"
When using treatment contrasts, the first level is always treated as the reference level.
If we fit a linear model to this data, we get the following solution:
Call:
lm(formula = temp ~ season, data = temps_comp)
Residuals:
Min 1Q Median 3Q Max
-11.5316 -1.7289 0.1078 1.8931 11.2572
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.7283 0.2312 11.80 <2e-16 ***
seasonSpring 5.2934 0.3338 15.86 <2e-16 ***
seasonSummer 10.7282 0.3265 32.86 <2e-16 ***
seasonAutumn 6.9975 0.3293 21.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.102 on 698 degrees of freedom
Multiple R-squared: 0.6154, Adjusted R-squared: 0.6137
F-statistic: 372.2 on 3 and 698 DF, p-value: < 2.2e-16
We’ve got a bunch of parameter estimates under “Estimate” and a bunch of significant p-values. But what does it mean to say “There is a significant effect of Spring”? And can we say anthing about, say, the difference between Spring and Autumn? Approach results that look like this exactly like you would a regression with continuous predictors: try to draw the graph it is a recipe for.
We start with the “Intercept”. With continuous predictors, the “intercept” is the y value where the line crosses 0 on the x-axis (a.k.a. the predicted value of the outcome variable when the predictor variable is 0). With categorical variables, there are no lines to draw to describe the relationship between outcome and predictor, so calling it an intercept is more of a metaphorical holdover of the continuous case.
Understand the “Intercept” for categorical predictors to be the predicted average of the reverence level. For the seasons model, the intercept is 2.73, meaning the estimated average winter temperature in Edinburgh is 2.73

Next, we move to the the Spring effect. Model estimates a Spring effect of 5.29. This is not the estimated temperature of Edinburgh in spring. This is how different the Spring temperature is from the Winter temperature. Meaning to get the estimated Spring temperature from the model we need to add the Spring effect to the intercept (2.73 + 5.29 = 8.02)

Next up is the Summer effect, which the model estimates is 10.73. Again, this is not the estimated temperature in the summer. This is how different Summer is estimated to be from the Intercept (Winter). To get the estimated Summer temperature from the model we need to add the Summer effect to the intercept (2.73 + 10.73 = 13.46).

And, what should seem familiar by now, the effect of Autumn is 7, meaning Autumn is 7 degrees warmer than Winter, meaning the estimated average temperature in Autumn is (2.73 + 7 = 9.73).

Let’s review the summary of the model again:
summary(season_model)
Call:
lm(formula = temp ~ season, data = temps_comp)
Residuals:
Min 1Q Median 3Q Max
-11.5316 -1.7289 0.1078 1.8931 11.2572
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.7283 0.2312 11.80 <2e-16 ***
seasonSpring 5.2934 0.3338 15.86 <2e-16 ***
seasonSummer 10.7282 0.3265 32.86 <2e-16 ***
seasonAutumn 6.9975 0.3293 21.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.102 on 698 degrees of freedom
Multiple R-squared: 0.6154, Adjusted R-squared: 0.6137
F-statistic: 372.2 on 3 and 698 DF, p-value: < 2.2e-16
We can say the following things about this model:
- The significant
(Intercept) effect means the average temperature in Winter is significantly different from 0 (estimated to be 2.7).
- The significant
seasonSpring effect means the difference between Spring and Winter is significantly different from 0 (Spring is estimated to be 5.3 degreees warmer than Winter.)
- The significant
seasonSummer effect means the difference between Summer and Winter is significantly different from 0 (Summer is estimated to be 10.7 degreees warmer than Winter.)
- The significant
seasonAutumn effect means the difference between Autumn and Winter is significantly different from 0 (Autumn is estimated to be 7 degreees warmer than Winter.)
And that is pretty much all we can say. Looking at these results, we can say that Spring is significantly warmer than Winter, but we can’t say anything about a generic “effect of Spring.” In fact, what would we mean by such a thing anyway?
We also can’t say anything about the difference between any of the other seasons. For example, Spring and Autumn have very similar looking temperatures (their difference from Winter is very similar). We can’t say anything about whether or not Spring is warmer or cooler than Autumn.
A more practical example - Scaling and transorming predictors
We’re now going to walk through a real example from my own research invovling the vowel /ay/. In this example, we’re going to be walking through
Brief Background on /ay/
The English vowel /ay/ is the vowel sound that appears in the English words ride and price. It is a “diphthong”, meaning it is a sound with two different articulatory targets. In the dialect this data is drawn from, the first target is something like a low and back [ɑ], which appears in the words hot and lot, and the second target is something like a high and front [i:], which shows up in the words fleece and heat. We’ll be focusing on just the first part of the diphthong, the typically low and back part.
Vowel sounds can be characterized using both articulatory and acoustic measures, and it has been found that certain acoustic features can be reliably correlated with the articulatory position of the tongue while the vowels are produced. This means we can characterize speakers’ pronunciations of vowel sounds from just a recording of their speech. If we examine how these acoustic properties of vowels covary with speakers’ ages or dates of birth, we can also estimate how pronunciations of these vowels have changed over time.
These acoustic properties are called “formants”, and the formant we are going focus on here is “F1”, which covaries with the height of the tongue in the mouth. The higher F1, the lower in the the mouth the tongue is. It’s also the case that F1 values systematically covary with the length of speakers’ vocal tracts, so we are actually going to be looking at normalized F1, which attempts to factor out this effect.
For this illustration, we’re going to first focus on the vowel /ay/ before voiceless consonants (i.e. /p, t, k, f, s/). In Philadelphia, where this data is from, /ay/ underwent a fairly large shift in its pronunciation across the 20th century.
ayT_means <- ay %>%
filter(plt_vclass == "ay0" )%>% # get the pre-voiceless tokens
mutate(dob = year - age) %>% # calculate speakers' dob
group_by(idstring, dob) %>% # prepare to estimate by-speaker means
summarise(F1_n = mean(F1_n)) # means
ayT_means %>%
ggplot(aes(dob, F1_n))+
geom_point()+
theme_bw()

In this plot, each point represents one speaker. Each speaker’s average normalized F1 value is plotted against their date of birth (ranging from 1889 to 1998). Fitting a linear model to this data appears easy, but actually poses us with interpretive challenges.
ayT_model1 <- lm(F1_n ~ dob, data = ayT_means)
summary(ayT_model1)
Call:
lm(formula = F1_n ~ dob, data = ayT_means)
Residuals:
Min 1Q Median 3Q Max
-0.93874 -0.12952 -0.00738 0.12320 0.82237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.8100970 1.0232146 25.23 <2e-16 ***
dob -0.0129710 0.0005262 -24.65 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2176 on 324 degrees of freedom
Multiple R-squared: 0.6522, Adjusted R-squared: 0.6512
F-statistic: 607.7 on 1 and 324 DF, p-value: < 2.2e-16
Let’s sort out what is being reported here in this model, starting with the intercept.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.8100970 1.0232146 25.23 <2e-16 ***
We see here that the intercept is significantly different from 0, but what does this intercept mean? The intercept is the estimated value of the outcome variable when the predictor is 0, meaning the model is predicting a value of normalized F1 of about 26 for speakers born in the year 0.

This is obviously problematic. First, this is a “significant” effect, but it extrapolates so far beyond the data that it’s entirely meaningless. (Nobody was speaking “English” in the year 0.) Second, this estimated normalized F1 value is not a plausible F1 value for any vowel. If you were to try to map this onto an articulatory strategy, speaker’s tongue bodies would be down somewhere around their ankles.
Let’s also look at the slope.
Estimate Std. Error t value Pr(>|t|)
dob -0.0129710 0.0005262 -24.65 <2e-16 ***
This means for every 1 unit increase in dob (i.e. 1 year), there is a decrease of -0.013 in normalized F1.

If we ignore the comically weird intercept, this slope is less problematic. The one major issue in it is its overspecificity. This slope is defined in terms of a change in 1 year of date of birth, but in an important sense that’s over precise. We wouldn’t really want to make any strong inferences about the difference between speakers’ born in 1969 and 1970, for example.
Rescaling
We can address these problems with the modekl by rescaling the date of birth predictor. Rescaling predictors is not only “allowed,” it can be crucial when it comes to trying to fit mixed effects models!
Idiom
Not to confuse things too much, but transormations you make to predictors can be lumped into two broad categories: linear and non-linear. This use of “linear” isn’t related to linear regression in any way. Instead, it describes the relationship between the original measure and the transformed measure.
- addition, subtraction, multiplication, division
Proof by simulation:
linear_df <- data_frame(original = rnorm(100, mean = 50),
addition = original + 10,
subtraction = original - 10,
multiplication = original * 10,
division = original/10)
Any combination of (only) linear transforms is a linear transform.
- logarithms (any base), cosine, square root, etc.
Proof by simulation:
nonlinear_df <- data_frame(original = rnorm(100, mean = 5),
log_ten = log10(original),
cosine = cos(original),
square_root = sqrt(original),
exponential = exp(original))
If we apply one, or many, linear transformations to the predictor variable, we won’t affect the goodness of fit of the model, nor the significance or sign of that predictor. But it will change the way we interpret the effects.
There are a few different kinds of transformations we can do. A very common kind of transformation is to z-score the data. To z-score numeric data, you subtract the mean and divide by the standard deviation. Here’s how that looks in mathematical notation
\[
z = \frac{x-\mu}{\sigma}
\]
Where \(\mu\) is the mean of the \(x\) data, and \(\sigma\) is the standard deviation. Here’s how that looks in practice. The date of birth of speakers has the following distribution of data, with the meand and \(\pm1\) standard deviation annotated.

By subtracting out the mean, we center this distribution on 0, and by dividing by the standard deviation, we shrink down the distribution so that 1 standard deviation from the mean is the value 1. Here’s how the result looks:

Why is this an improvement?
Let’s re-fit the model with this transformed DOB predictor and see:
ayT_model2 <- lm(F1_n ~ dob_z, data = ayT_means)
summary(ayT_model2)
Call:
lm(formula = F1_n ~ dob_z, data = ayT_means)
Residuals:
Min 1Q Median 3Q Max
-0.93874 -0.12952 -0.00738 0.12320 0.82237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.58862 0.01205 48.83 <2e-16 ***
dob_z -0.29760 0.01207 -24.65 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2176 on 324 degrees of freedom
Multiple R-squared: 0.6522, Adjusted R-squared: 0.6512
F-statistic: 607.7 on 1 and 324 DF, p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.58862 0.01205 48.83 <2e-16 ***
The intercept is the estimated value of normalized F1 when dob_z is 0, which is the mean of dob, 1944.

Estimate Std. Error t value Pr(>|t|)
dob_z -0.29760 0.01207 -24.65 <2e-16 ***
The slope tells us how much normalized F1 changes for every change of 1 standard deviation of dob (23 years)
For every 1 standard deviation shift in dob (23 years), there is a decrease of 0.3 in normalized F1.

These values of the intercept and slope are more interpretable.
Are we really allowed to do all this rescaling?
Yes, but, you must explain what you did in the write up. For example:
The date of birth predictor was centered on 1960, and divided by 10. The 1960 was chosen as the centering point because of its importance in previous sociolinguistic studies. Dividing dob by 10 gives the resulting slope the interpretation of the rate of change per decade.
Multivariate Models & Interactions
We’re quickly going to go over multivariate models (multiple predictors) and interactions. The important thing to keep in mind when including multiple predictor is that your reference level and scaling of each predictor is going to start to matter a lot more.
Continious and categorical predictors
Let’s start with the /ay/ data. Previously we were looking only at pre-voiceless /ay/. But now we’re going to look at all /ay/, with each token coded for whether it’s pre-voiceless, or elsewhere. This is encoded in the plt_vclass column.
plt_vclass
ay0 - Pre-voiceless /ay/
ay - All other /ay/.
Because we’re not dealing with mixed effects models yet, we’ll first calculate by-speaker means. In the process I’ll be re-scaling date of birth with two differing centering points, one on 1900, and one on 1960.
ay_means <- ay %>%
mutate(dob = year-age) %>%
group_by(idstring, dob, plt_vclass) %>%
summarise(F1_n = mean(F1_n))%>%
mutate(decade_1900 = (dob - 1900)/10,
decade_1960 = (dob - 1960)/10,
voicing = factor(recode(plt_vclass, ay0 = "voiceless",
ay = "else")),
voicing = relevel(voicing, "else"))
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
geom_point()+
theme_bw()

Two predictors, no interactions
We can see that there is a big difference between pre-voiceless and elsewhere /ay/. Let’s start building up the model then
ay_model1 <- lm(F1_n ~ decade_1900 + voicing, data = ay_means)
summary(ay_model1)
Call:
lm(formula = F1_n ~ decade_1900 + voicing, data = ay_means)
Residuals:
Min 1Q Median 3Q Max
-1.01218 -0.14438 -0.01086 0.13725 0.77522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.567577 0.020862 75.14 <2e-16 ***
decade_1900 -0.075493 0.003793 -19.91 <2e-16 ***
voicingvoiceless -0.643357 0.017376 -37.02 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2218 on 649 degrees of freedom
Multiple R-squared: 0.7314, Adjusted R-squared: 0.7305
F-statistic: 883.5 on 2 and 649 DF, p-value: < 2.2e-16
Let’s walk through the effects by drawing the graph that this model is a recipe for. First, we have the Intercept.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.567577 0.020862 75.14 <2e-16 ***
This is the estimated average value of normalized F1 for when all predictors are 0. That means this is the predicted intercept when the centered dob is 0 (1900), for the reference level (elsewhere /ay/)

Next there’s the slope:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.567577 0.020862 75.14 <2e-16 ***
There’s a few different ways we could construe this slope. The value is the rate of change per decade, and we’ll interpret this as the slope for reference level.

Finally, there’s the voicing parameter.
Estimate Std. Error t value Pr(>|t|)
voicingvoiceless -0.643357 0.017376 -37.02 <2e-16 ***
This tells us how different the intercept of the line for pre-voiceless /ay/ is from elsewhere /ay/. Because we have included no other effects or interactions, the model is assuming that the slope of the pre-voiceless /ay/ is identical to elsewhere /ay/.

Note, looking a this graph, you can see that this voiceless effect is not the estimated intercept of pre-voicelss /ay/. It is the estimated difference in intercepts between these two lines.
Evaluating the model.
Let’s look at the model summary again:
summary(ay_model1)
Call:
lm(formula = F1_n ~ decade_1900 + voicing, data = ay_means)
Residuals:
Min 1Q Median 3Q Max
-1.01218 -0.14438 -0.01086 0.13725 0.77522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.567577 0.020862 75.14 <2e-16 ***
decade_1900 -0.075493 0.003793 -19.91 <2e-16 ***
voicingvoiceless -0.643357 0.017376 -37.02 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2218 on 649 degrees of freedom
Multiple R-squared: 0.7314, Adjusted R-squared: 0.7305
F-statistic: 883.5 on 2 and 649 DF, p-value: < 2.2e-16
If you only fit this model, and didn’t plot any of the data, you might conclude:
- There is a significant change over decade for /ay/.
- There is a significant effect of voicing on /ay/.
Including interactions
However, looking at the plot, it’s clear something more complicated is going on. Neither of the lines look like they’re passing through their related data points very well. There is, in fact, an interaction between voicing and date of birth. Pre-voiceless /ay/ is undergoing a change, while elsewhere /ay/ is basically staying the same. We can account for this by including this interaction in the model:
ay_model2 <- lm(F1_n ~ decade_1900 * voicing, data = ay_means)
summary(ay_model2)
Call:
lm(formula = F1_n ~ decade_1900 * voicing, data = ay_means)
Residuals:
Min 1Q Median 3Q Max
-0.93874 -0.11050 -0.00824 0.10554 0.82237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.326564 0.022219 59.703 < 2e-16 ***
decade_1900 -0.021277 0.004443 -4.789 2.08e-06 ***
voicingvoiceless -0.161329 0.031423 -5.134 3.75e-07 ***
decade_1900:voicingvoiceless -0.108433 0.006283 -17.257 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1838 on 648 degrees of freedom
Multiple R-squared: 0.816, Adjusted R-squared: 0.8151
F-statistic: 957.6 on 3 and 648 DF, p-value: < 2.2e-16
Let’s build up the graph.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.326564 0.022219 59.703 < 2e-16 ***
This is the estimated average for when all preditors are 0 (i.e. dob = 1900, voicing = else).

Estimate Std. Error t value Pr(>|t|)
decade_1900 -0.021277 0.004443 -4.789 2.08e-06 ***
This is the estimated slope for the reference level (i.e. voicing = else)

Now, we get into the more complicated coefficients.
Estimate Std. Error t value Pr(>|t|)
voicingvoiceless -0.161329 0.031423 -5.134 3.75e-07 ***
This is how different the line for pre-voiceless /ay/ is from elsewhere /ay/ at the intercept (i.e. when dob = 1900).

Finally, we have the interaction effect.
Estimate Std. Error t value Pr(>|t|)
decade_1900:voicingvoiceless -0.108433 0.006283 -17.257 < 2e-16 ***
This is the most easilly misinterpretable parameter in this model. It is not the estimated slope of pre-voiceless /ay/. It is how different pre-voiceless /ay/’s slope is from elsewhere /ay/. That is, if you drew a line parallel to elsewhere /ay/, thos is how much you’d have to rotate the the line to get pre-voiceless /ay/’s slope.

Be careful thinking about models with interactions
This is super important to keep in mind, let’s say we ran this model, and got the following coefficients:
(Intercept) 1.32 ***
decade_1900 -10 ***
voicingvoiceless -0.16 ***
decade_1900:voicingvoiceless 10 ***
The significant interaction between date of birth and voicing here would not mean that there is significant change for pre-voiceless /ay/ over time. It means there is a significant difference between the slope of elsewhere /ay/ and pre-voiceless /ay/. The estimated slope of elsewhere /ay/ in this fake model is -10, and for pre-voiceless /ay/ it’s -10 + 10 = 0!
The other thing to keep in mind with interpreting the results of the model is that the voicingvoiceless effect is not an estimate of “the effect of voicing.” It is the estimated difference between the two lines at the intercept. If we change where the intercept is, for example by using a different scaling, we also change the value of this parameter.
summary(lm(F1_n ~ decade_1900 * voicing, data = ay_means))
Call:
lm(formula = F1_n ~ decade_1900 * voicing, data = ay_means)
Residuals:
Min 1Q Median 3Q Max
-0.93874 -0.11050 -0.00824 0.10554 0.82237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.326564 0.022219 59.703 < 2e-16 ***
decade_1900 -0.021277 0.004443 -4.789 2.08e-06 ***
voicingvoiceless -0.161329 0.031423 -5.134 3.75e-07 ***
decade_1900:voicingvoiceless -0.108433 0.006283 -17.257 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1838 on 648 degrees of freedom
Multiple R-squared: 0.816, Adjusted R-squared: 0.8151
F-statistic: 957.6 on 3 and 648 DF, p-value: < 2.2e-16
summary(lm(F1_n ~ decade_1960 * voicing, data = ay_means))
Call:
lm(formula = F1_n ~ decade_1960 * voicing, data = ay_means)
Residuals:
Min 1Q Median 3Q Max
-0.93874 -0.11050 -0.00824 0.10554 0.82237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.198902 0.012301 97.467 < 2e-16 ***
decade_1960 -0.021277 0.004443 -4.789 2.08e-06 ***
voicingvoiceless -0.811927 0.017396 -46.674 < 2e-16 ***
decade_1960:voicingvoiceless -0.108433 0.006283 -17.257 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1838 on 648 degrees of freedom
Multiple R-squared: 0.816, Adjusted R-squared: 0.8151
F-statistic: 957.6 on 3 and 648 DF, p-value: < 2.2e-16
With the date of birth centered on 1900, we get a relatively weak effect (-0.16). With the date of birth centered on 1960, we get a much stronger effect (-0.81). That’s because there is a strong interaction between the date of birth and voicing context, and there is a bigger difference between these two lines in 1960 than in 1900!

It is, in fact, more than possible for a predictor like this to have an estimated value close to 0, and be non-significant, but for it to actuallty matter a lot to the model.
Lessons
- Always think carefully about the meaning of predictors in a model with interactions.
- Always try to draw the graph out.
- Always remember the parameters are estimates of the difference from the reference levels.
- Non-significant != not important.
Categorical interactions.
We’ll return to the weather data to examine the interpretation of categorical interactions. We’ll be using season and a categorical coding of year.
temps_comp <- temps_comp %>%
mutate(year_factor = factor(year))
Fitting the model is straight forward.
temps_model <- lm(temp ~ season * year_factor, data = temps_comp)
summary(temps_model)
Call:
lm(formula = temp ~ season * year_factor, data = temps_comp)
Residuals:
Min 1Q Median 3Q Max
-10.336 -1.744 0.041 1.726 11.898
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2151 0.3092 3.929 9.37e-05 ***
seasonSpring 6.1661 0.4603 13.395 < 2e-16 ***
seasonSummer 12.7472 0.4385 29.067 < 2e-16 ***
seasonAutumn 7.3150 0.4451 16.436 < 2e-16 ***
year_factor2011 3.0264 0.4373 6.920 1.03e-11 ***
seasonSpring:year_factor2011 -1.8707 0.6333 -2.954 0.00324 **
seasonSummer:year_factor2011 -4.0215 0.6176 -6.511 1.43e-10 ***
seasonAutumn:year_factor2011 -0.7270 0.6231 -1.167 0.24375
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.934 on 694 degrees of freedom
Multiple R-squared: 0.6579, Adjusted R-squared: 0.6545
F-statistic: 190.7 on 7 and 694 DF, p-value: < 2.2e-16
There’s a lot going on here, and it is very easy to misinterpret the results if you don’t try to draw out the graph. We’ll take the coefficients in batches. First, there is the “intercept.”
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2151 0.3092 3.929 9.37e-05 ***
This is the estimated average temperature for the references levels, that is, for season = Winter, year = 2010.

Next, there are the season effects.
Estimate Std. Error t value Pr(>|t|)
seasonSpring 6.1661 0.4603 13.395 < 2e-16 ***
seasonSummer 12.7472 0.4385 29.067 < 2e-16 ***
seasonAutumn 7.3150 0.4451 16.436 < 2e-16 ***
Recall, these are the estimated differences between Winter and the named season. But because there is an interaction effect between season and year, these values are even more specific: these are the estimated differences between each of these named seasons and Winter in 2010, because 2010 is the reference level for year.

Next is the year effect
Estimate Std. Error t value Pr(>|t|)
year_factor2011 3.0264 0.4373 6.920 1.03e-11 ***
Again, because there is an interaction effect between season and year, this is estimated difference between Winter 2010 and Winter 2011. I’m drawing this on the graph for winter in a solid line, and for all of the other seasons in a dashed line. As will be made clear in a moment, this is is because the interaction effects can only be interpreted in relation to these dashed lines.

Finally, there are the interation terms!
Estimate Std. Error t value Pr(>|t|)
seasonSpring:year_factor2011 -1.8707 0.6333 -2.954 0.00324 **
seasonSummer:year_factor2011 -4.0215 0.6176 -6.511 1.43e-10 ***
seasonAutumn:year_factor2011 -0.7270 0.6231 -1.167 0.24375
The effect of year in the model was about 3, meaning Winter 2011 was 3 degrees warmer than Winter 2010. If we look at the Spring:Year interaction here, it has a value of -1.9, meaning Spring 2011 was 3-1.9 = 1.1 degree warmer than Spring 2010.
That is, these interaction terms are telling us how different the year effect for each season is from the year effect for Winter, our reference level. Just because Spring:2011 has a significant negative value doesn’t mean Spring 2011 was colder than Spring 2010! It just means that the difference between Spring 2010 and Spring 2011 was not as big as that between Winter 2010 and Winter 2011!
The significant Spring:2011 effect also does not mean that Spring 2011 was a significantly different temperature than Spring 2010!

Next Week
- Introducing Mixed Effects Models
---
title: "Intro & Reviewing Linear Models"
output: 
  html_notebook: 
    code_folding: none
    css: ../custom.css
    theme: flatly
    toc: yes
    toc_float: yes
    toc_depth: 3
---

# Agenda

1. Introductions
2. Course Admin
3. Setup
4. Reviewing the most important parts of Linear Models


# Introductions

<div class = "box break">
<span class = "big-label">Hello!</span>

My name is Josef Fruehwald, and I'm a lecturer in Linguistics and English Language.

</div>

In this first block of multivariate stats, we'll be covering

- The reasons why you should fit Linear Mixed Effects models.
- How to fit LMMs.
- How to interpret the results of LMMs.
- How to do model comparison & parameter evaluation.
- How to report the results of your LMMs.
- Some pragmatic issues in thinking about your results.

In the lab sessions, you'll work through exercises related to fitting LMMs, as well as related R programming skills related to data organization and visualization.

# Course Admin

All lectures will include some R programming you can do during the lecture. I'll recommend that you create a fresh R Notebook for each lecture.

<div class = "box break">
<span class = "big-label">~5 Minute Setup</span>

First create a new R Notebook for this lecture with `File > New File > R Notebook`. I would recommend changing the top header metadata so it has a meaningful title:

```
---
title: "Multivariate - Lecture 1"
output: html_notebook
---
```

Delete everything else in the notebook.

<hr>

Now, let's install some packages that will be necessary in these lectures:


```{r}
install.packages(
  c("tidyverse",
    "devtools",
    "lme4")
)

library("devtools")

install_github("jofrhwld/multivariate2017")

library(tidyverse)
library(lme4)
library(multivariate2017)
```



</div>


<div class = "box hygiene">
<span class = "label">Workspace Hygiene</span>

## Recommended Course Directory Structure

If you have a directory planning structure that you're happy with, go ahead and do that. But I'd recommend the following directory structure & naming conventions.

    ├── multivariate_2017
    │   └── lmm
    │   │   └── assessment
    │   │   └── data
    │   │   └── labs

For a good philosophy on naming files, I'd recommend [these slides from Jenny Bryan's talk on Naming Things](http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf)

Also good practice is to load all the packages you're going to need at the *top* of your script/notebook in one block, rather than haphazardly throughout.

</div>


# The Plan For This Week

## Lecture

Today, we're going to going to review the estimation & interpretation of linear models, with a focus on:

- the scaling & transformation of predictors
- the selection of contrasts
- the interpreation of interaction effects

## Labs

The lab exercises this week will focus on data (re)organization and vizualization.


# Understanding Linear Regression



I find the best way to understand linear regressions is as recipes for drawing a graph. Let's start off with an example relevant to my own life: the relationship between temperatures in Celcius and temperatures in Farenheit. There are two important numeric constants to convert back and forth between them:

1. Freezing = 0&#176;C = 32&#176;F
2. A change of 1&#176;C = 1.8&#176;F


This gives us all of the necessary components for drawing a graph of the relationship between Farenheit and Celcius. First, we know that a line plotting this relationship must cross 0 on the x-axis at 32 on the y-axis, because freezing = 0&#176;C = 32&#176;F.


```{r echo = F}
library(grid)
ggplot()+
  xlim(-3, 40)+
  ylim(0, 40)+
  geom_vline(xintercept = 0, linetype  = 2)+
  geom_hline(yintercept = 0, linetype  = 2)+
  annotate(geom = "segment", x = 0, y = 0, xend = 0, yend = 32,
           size = 1, arrow = arrow())+
  annotate(geom = "point", x = 0, y = 32, color = 'red')+
  xlab("Celcius")+
  ylab("Farenheit")+
  theme_minimal()+
  coord_fixed()
```

Next, we know that for 1 every degree we move to the right, we must move up 1.8.

```{r echo = F}

fc <- data_frame(celcius = c(0, 1, 2, 3, 4),
                 Farenheit = (celcius * 1.8) + 32)

ggplot(fc)+
  xlim(-3, 40)+
  ylim(0, 40)+
  geom_vline(xintercept = 0, linetype  = 2)+
  geom_hline(yintercept = 0, linetype  = 2)+
  annotate(geom = "segment", x = 0, y = 0, xend = 0, yend = 32,
           size = 1, arrow = arrow())+
  geom_step(aes(x  = celcius, y = Farenheit))+  
  geom_point(aes(x  = celcius, y = Farenheit), 
             color = 'red')+
  xlab("Celcius")+
  ylab("Farenheit")+
  theme_minimal()+
  coord_fixed()
```

The line, representing this relationship, thus has the formula:

$$
y = 32 + (1.8\times x)
$$

And looks like

```{r echo = F}

ggplot(fc)+
  xlim(-3, 60)+
  ylim(0, 60)+
  geom_vline(xintercept = 0, linetype  = 2)+
  geom_hline(yintercept = 0, linetype  = 2)+
  annotate(geom = "segment", x = 0, y = 0, xend = 0, yend = 32,
           size = 1, arrow = arrow())+
  geom_step(aes(x  = celcius, y = Farenheit))+  
  geom_point(aes(x  = celcius, y = Farenheit), 
             color = 'red')+
  geom_abline(intercept = 32, slope = 1.8, color = 'red')+
  xlab("Celcius")+
  ylab("Farenheit")+
  theme_minimal()+
  coord_fixed()
```

This line has the following properties:

1. Its **Intercept**, or where the line crosses 0 on the x-axis, is 32.
2. Its **Slope**, is 1.8 (that is, a change of 1 on the x-axis corresponds to a change in 1.8 on the y-axis)


And, in fact, *all* lines, including the ones we fit using linear regression, are described in terms of their intercept (where the line crosses 0 on the x-axis) and slope.


But, let's say you didn't know the formula to convert back and forth between Celcius and Farenheit. Maybe you weren't even sure that there was such a formula. And all you had were a bunch of kind of crappy thermometers that gave you both Celcius and Farenheit readings. And maybe the thermometers are kind of hard to read, or the lighting conditions are dark, or you only have limited experience in reading thermometers. In that case, the relationship between Celcius and Farenheit might look a lot more like this:

```{r, echo = F}
true_c <- c(0, 10, 16, 20, 23, 30)
true_f <- (true_c*1.8) + 32
c_observe <- map(true_c, ~rnorm(20, mean = .x, sd = 5)) %>% simplify()
f_observe <- map(true_f, ~rnorm(20, mean = .x, sd = 5*1.8)) %>% simplify()

temp_df <- data.frame(c_observe = c_observe,
                      f_observe = f_observe)


temp_df %>%
  ggplot(aes(c_observe, f_observe))+
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    #geom_abline(intercept = 32, slope = 1.8, color = "grey40")+
    geom_point()+
    #stat_smooth(method = lm)+
    #xlim(-10, 40)+
    xlab("Celcius")+
    #ylim(0, 100)+
    ylab("Farenheit")+
    coord_fixed()+
    theme_minimal()

```


In this case, you might want to fit a **linear regression**. A linear regression assumes that the relationship between two variables can be captured using *some* line, and it tries to estimate the *Intercept* and *Slope* of this assumed line.

The way it does this is by trying to find parameters for a line that minimize the "error" or the "residuals" between the observed data, and the line. For example, here are examples of *bad* lines. The first assumes that all temperatures Celcius are basically 50&#176;F. The other assumes that all temperatures in Farenheit are basically 3$\times$C.

<div style = "float:left;width = 45%;">
```{r echo = F}
ggplot(temp_df, aes(c_observe, f_observe))+
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_segment(aes(xend = c_observe, yend = 50))+
    geom_point(color = "grey40")+
    geom_hline(yintercept = 50, color = 'red')+
    #stat_smooth(method = 'lm', color = 'red', se = F)+
    xlab("Celcius")+
    ylab("Farenheit")+
    theme_minimal()+
    ggtitle("F = 50")

```
</div>
<div style = "float:left;width = 45%;">
```{r echo = F, width = 3}
ggplot(temp_df, aes(c_observe, f_observe))+
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_segment(aes(xend = c_observe, yend = c_observe * 3))+
    geom_point(color = "grey40")+
    geom_abline(intercept = 0, slope = 3, color = "red")+
    #stat_smooth(method = 'lm', color = 'red', se = F)+
    xlab("Celcius")+
    ylab("Farenheit")+
    theme_minimal()+
    ggtitle("F = 3*C")

```
</div>

Here is a plot of the best-fitting line, that has the smallest average residuals.
```{r echo = F}
temp_mod <- lm(f_observe~c_observe, data = temp_df)
temp_df$fitted <- fitted(temp_mod)

ggplot(temp_df, aes(c_observe, f_observe))+
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_segment(aes(xend = c_observe, yend = fitted))+
    geom_point(color = "grey40")+
    stat_smooth(method = 'lm', color = 'red', se = F, fullrange = T)+
    xlab("Celcius")+
    ylab("Farenheit")+
    theme_minimal()

    
```

There are actually many different ways to figure out what the best line is (remembering that by "line" we really mean the pairing of Intercept and Slope). 
For example, one really simplistic approach^[Actually, this "simplistic approach" is more like the sophisticated approach these days, because there are complicated maths behind making good guesses which are given semantically opaque names like "Simulated Annealing" and "Hamiltonian Monte Carlo".] would be just to draw a bunch of different lines, and see which one worked best.



```{r echo = F}
temp_coef <- coef(temp_mod)
temp_Sigma <- vcov(temp_mod)

mod_df <- data.frame(c_observe = -14:41)
mod_matrix <- model.matrix(~c_observe, data = mod_df)


Beta <- MASS::mvrnorm(6, temp_coef, temp_Sigma * 50)
as_tibble(t(Beta %*% t(mod_matrix))) %>%
  bind_cols(mod_df, .)%>%
  gather(key, value, -c_observe)%>%
  ggplot(aes(c_observe, value))+
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_point(data = temp_df, 
               aes(y = f_observe), 
               group = NA,
               color = "grey80")+
    geom_line(aes(group = key), color = "grey40")+
    stat_smooth(data = temp_df, 
                aes(c_observe, f_observe), 
                color = 'red',
                method = lm, 
                se = F,
                fullrange = T)+
    xlab("Celcius")+
    ylab("Farenheit")+  
    theme_minimal()

```

However, there have been some long established methods for *solving* this problem, at is, for finding the one and truly best Intercept and Slope for a given set of data points.

<div class = "box idiom">
<span class = 'label'>Idiom</span>

This isn't R idiom, but stats idiom. Here are some translations for how people describe solving this problem.

- *analytic*, *closed form* = "We solved it with complicated math."
- *iterative*, *stochastic* = "We made a bunch of guesses."

</div>


For the sake of finishing the illustration, here is the solution that a linear regression found for this sample data:

```{r echo = F}
temp_mod
```

The model has estimated that the Intercept is $\approx$ `r round(coef(temp_mod)[1])`  and the slope is $\approx$ `r round(coef(temp_mod)[2], digits = 1)` . That is to say, based on our "sample", it thinks 0&#176;C = `r round(coef(temp_mod)[1])`&#176;F, and that every 1&#176;C increase corresponds to a `r round(coef(temp_mod)[2], digits = 1)`&#176;F increase.

We are lucky to know that this is, strictly speaking, wrong, but we won't usually, or typically, be able to in such a situation. However, the estimate of both the intercept and slope have the right *sign* (that is, it hasn't guessed the intercept is -10, or the slope goes the other way), and they're just about the right *magnitude* (that is, it hasn't guessed that the slope is something like 10, or 100).

Now, let's consider how the linear model is describing the value of this highlighted dot.
```{r echo = F}
one_dot <- temp_df %>% 
                sample_n(1)


ggplot(temp_df, aes(c_observe, f_observe)) +
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_point(color = "grey40")+
    stat_smooth(method = 'lm', se = F, color = 'black', size = 0.5)+
    geom_point(data = one_dot, color = 'red', size = 3)+
    xlab("Celcius")+
    ylab("Farentheit")+  
    theme_minimal()
```

First, you take it's value in Celcius, and <span style = "color:#216D65;font-weight:bold;">multiply it by `r coef(temp_mod)[2]`</span> (which is `r round(one_dot$c_observe[1] * coef(temp_mod)[2], digits = 2)`)

```{r echo = F}
one_dot$mult <- one_dot$c_observe * coef(temp_mod)[2]
ggplot(temp_df, aes(c_observe, f_observe)) +
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_point(color = "grey40")+
    stat_smooth(method = 'lm', se = F, color = 'black', size = 0.5)+
    geom_point(data = one_dot, color = 'red', size = 3)+
    geom_segment(data = one_dot, 
                 aes(xend = c_observe-0.5,
                     x = c_observe-0.5,
                     y = 0,
                     yend = mult),
                 color = "#216D65",
                 arrow = arrow())+
    xlab("Celcius")+
    ylab("Farentheit")+  
    theme_minimal()
```


Then,<span style = "color:#B08534;font-weight:bold;">you add `r coef(temp_mod)[1]` to it</span> (which is `r round((one_dot$c_observe[1] * coef(temp_mod)[2]) + coef(temp_mod)[1], digits = 2)`)

```{r echo = F}
one_dot$add <- one_dot$mult + coef(temp_mod)[1]

ggplot(temp_df, aes(c_observe, f_observe)) +
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_point(color = "grey40")+
    stat_smooth(method = 'lm', se = F, color = 'black', size = 0.5)+
    geom_point(data = one_dot, color = 'red', size = 3)+
    geom_segment(data = one_dot, 
                 aes(xend = c_observe-0.5,
                     x = c_observe-0.5,
                     y = 0,
                     yend = mult),
                 color = "#216D65",
                 arrow = arrow())+
     geom_segment(data = one_dot, 
                  aes(x = c_observe-0.5,
                      xend = c_observe + 0.5,
                      y = mult,
                      yend = mult),
                  color = "grey20")+
     geom_segment(data = one_dot, 
                 aes(xend = c_observe + 0.5,
                     x = c_observe+0.5,
                     y = mult,
                     yend = add),
                 color = "#B08534",
                 arrow = arrow())+
    xlab("Celcius")+
    ylab("Farentheit")+  
    theme_minimal()
```

Then, you <span style = "color:#AC333C;font-weight:bold;">add the residual error</span> to give us the observed data point.

```{r echo = F}
ggplot(temp_df, aes(c_observe, f_observe)) +
    geom_vline(xintercept = 0, linetype  = 2)+
    geom_hline(yintercept = 0, linetype  = 2)+  
    geom_point(color = "grey40")+
    stat_smooth(method = 'lm', se = F, color = 'black', size = 0.5)+
    geom_point(data = one_dot, color = 'red', size = 3)+
    geom_segment(data = one_dot, 
                 aes(xend = c_observe-0.5,
                     x = c_observe-0.5,
                     y = 0,
                     yend = mult),
                 color = "#216D65",
                 arrow = arrow())+
     geom_segment(data = one_dot, 
                  aes(x = c_observe-0.5,
                      xend = c_observe + 0.5,
                      y = mult,
                      yend = mult),
                  color = "grey20")+
     geom_segment(data = one_dot, 
                 aes(xend = c_observe + 0.5,
                     x = c_observe+0.5,
                     y = mult,
                     yend = add),
                 color = "#B08534",
                 arrow = arrow())+
     geom_segment(data = one_dot, 
                  aes(x = c_observe,
                      xend = c_observe + 0.5,
                      y = add,
                      yend = add),
                  color = "grey20")+ 
  geom_segment(data = one_dot, 
                 aes(xend = c_observe,
                     x = c_observe,
                     y = add,
                     yend = f_observe),
                 color = "#AC333C",
                 arrow = arrow())+  
    xlab("Celcius")+
    ylab("Farentheit")+  
    theme_minimal()
```

The generic formula for finding the value for *any* data point is

$$
F_i = `r round(coef(temp_mod)[1], digits = 1)` + (`r round(coef(temp_mod)[2], digits = 1)`\times C_i) + \epsilon_i.
$$
In this formula the $i$ is convention for referring to some generic, $i^{th}$ data point and $\epsilon$ is conventional for referring to the residual errors.

Some additional conventions:

- Referring to the outcome variable as $y$.
- Referring to the predictors as $x$.
- Referring to the remaining parameters as $\beta$.

So you may sometimes see people referring to $\beta_0$, or the "betas" in a model. The fully generic formula for linear regression is sometimes given as


$$
y_i = \beta_0 + \beta x_i + \epsilon_i
$$


<div class = "box idiom">
<span class = 'label'>Idiom</span>

- We call the *Intercept* and *Slopes* the **parameters** or **coefficients** of the model.
- For every data point, the value you get by only using the slopes and intercepts are called the **fitted values**, and they all lie along the lines you draw.
- Almost every data point's actual value is different from its fitted value, and these differences are called the **residuals**.

</div>

## Understanding categorical predictors

Categorical variables are necessarilly treated differently in regression models.
This is because while you can multiply numeric predictors together, you can't multiply 5 $\times$ "experimental condition".

To illustrate how categorical predictors work in a linear regression, we'll use some example weather data.
I pulled this data of the daily temperature from [the University of Edinburgh Geosciences Department weather station](https://www.ed.ac.uk/geosciences/weather-station/weather-station-data).


```{r echo =F, eval = F}
weathers <- paste("https://www.geos.ed.ac.uk/~weather/jcmb_ws/JCMB_", seq(2007, 2015), ".csv", sep = "")
weather_list <- map(weathers, read.csv)
weather_df <- bind_rows(weather_list)
```
```{r echo = F}
weather_df <- read_csv("../data/edin_weather.csv")
```


```{r echo = F}
library(lubridate)
weather_df %>%
  filter(surface.temperature..C. > -20,
         surface.temperature..C. < 40)%>%
  mutate(date.time = ymd_hm(date.time),
         date = as.Date(date.time)) %>%
  group_by(date)%>%
  summarise(temp = mean(surface.temperature..C.))%>%
  mutate(month = month(date, label = T),
         season = recode(month,
                         Jan = "Winter",
                         Feb = "Winter",
                         Mar = "Spring",
                         Apr = "Spring",
                         May = "Spring",
                         Jun = "Summer",
                         Jul = "Summer",
                         Aug = "Summer",
                         Sep = "Autumn",
                         Oct = "Autumn",
                         Nov = "Autumn",
                         Dec = "Winter"),
         season = factor(season, ordered = F),
         day = yday(date),
         year = year(date)) -> temps
```

```{r echo = F} 
library(magrittr)
library(scales)
temps %>% 
  filter(is.finite(temp)) %>%
  group_by(month, year)%>%
  mutate(extrema = "no",
         extrema = inset(extrema, which.max(temp), "hi"),
         extrema = inset(extrema, which.min(temp), "lo"))->temps

ggplot(temps, aes(date, temp))+
    geom_hline(yintercept = 0, color = muted("blue"))+
    geom_point(aes(color = extrema))+
    scale_color_manual(limits = c("hi", "no","lo"), 
                       values = c("red", "grey40", "blue"),
                       guide = "none")+
    ylab("daily average tempurature")+
    theme_minimal()
```

In order to focus on a categorical effect, I've recoded the date variable into "seasons" like so:

<div style = "width:50%;margin:auto;">

```{r results = 'asis', echo = F}
library(knitr)
temps %>% 
  ungroup()%>%
  count(month, season) %>%
  select(-n)%>%
  kable()
```
</div>

We'll focus on just the subset of 2010 and 2011, because 2010 was a notably cold winter. 
Here's all of the data for the four seasons from 2010 and 2011.
```{r echo = F}
temps %>%
  filter(year %in% c(2010, 2011)) -> temps_comp
temps_comp %>%
ggplot(aes(season, temp))+
    geom_hline(yintercept = 0, color = muted("blue"))+  
    geom_point(position = position_jitter(width = 0.2))+
    theme_minimal()
```

Winter is the "reference level", for the season's factor, which you can see if you use `levels()`.

```{r}
levels(temps_comp$season)
```

When using treatment contrasts, the first level is always treated as the reference level.

If we fit a linear model to this data, we get the following solution:
```{r echo = F}
season_model <- lm(temp ~ season, 
                   data = temps_comp)
summary(season_model)
```

We've got a bunch of parameter estimates under "Estimate" and a bunch of significant p-values.
But what does it mean to say "There is a significant effect of Spring"? 
And can we say anthing about, say, the difference between Spring and Autumn?
Approach results that look like this exactly like you would a regression with continuous predictors: try to draw the graph it is a recipe for.

We start with the "Intercept". 
With continuous predictors, the "intercept" is the y value where the line crosses 0 on the x-axis (a.k.a. the predicted value of the outcome variable when the predictor variable is 0). 
With categorical variables, there are no lines to draw to describe the relationship between outcome and predictor, so calling it an intercept is more of a metaphorical holdover of the continuous case. 

Understand the "Intercept" for categorical predictors to be **the predicted average of the reverence level**.
For the seasons model, the intercept is `r round(coef(season_model)[1], digits = 2)`, meaning the estimated average winter temperature in Edinburgh is `r round(coef(season_model)[1], digits = 2)`


```{r echo = F}
temps_comp %>%
  group_by(season)%>%
  summarise(temp = mean(temp))->temp_comp_means

temp_comp_vec <- temp_comp_means$temp
names(temp_comp_vec) <- temp_comp_means$season
temp_coef <- coef(season_model)
names(temp_coef) <- temp_comp_means$season
temp_range <- range(temps_comp$temp)

```


```{r echo= F}
temp_comp_means%>%
  filter(season == "Winter")%>%
  ggplot(aes(season, temp))+
    geom_hline(yintercept = 0, color = muted("blue"))+  
    geom_point()+
    geom_segment(aes(xend = "Winter", y=0, yend = temp), arrow = arrow())+
    geom_label(aes(x = 1.5, y = 1.5, label = paste0("intercept=",round(temp, digits = 1))))+
    geom_label(aes(x = "Winter",
                   y=temp_comp_vec["Winter"]+1,
                   label = round(temp_comp_vec["Winter"], digits = 1)))+  
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

Next, we move to the the `Spring` effect.
Model estimates a Spring effect of `r round(coef(season_model)[2], digits = 2)`.
This is *not* the estimated temperature of Edinburgh in spring. This is **how different the Spring temperature is from the Winter temperature**. 
Meaning to get the estimated Spring temperature from the model we need to add the Spring effect to the intercept (`r round(coef(season_model)[1], digits = 2)` + `r round(coef(season_model)[2], digits = 2)` = `r round(coef(season_model)[2], digits = 2) +round(coef(season_model)[1], digits = 2) `)


```{r echo = F}
temp_comp_means%>%
  filter(season == "Winter")%>%
  ggplot(aes(season, temp))+
    geom_hline(yintercept = 0, color = muted("blue"))+  
    geom_point()+
    geom_point(data = temp_comp_means %>% filter(season == "Spring"))+
    geom_step(data= temp_comp_means %>% filter(season %in% c("Winter", "Spring")), 
              aes(group = 1))+
    geom_segment(aes(xend = "Winter", y=0, yend = temp), arrow = arrow())+
    geom_segment(aes(xend = "Spring", 
                     x = "Spring",
                     y = temp_comp_vec["Winter"],
                     yend = temp_comp_vec["Spring"]),
                 arrow = arrow())+
    geom_label(aes(x = "Winter",
                   y=temp_comp_vec["Winter"]+1,
                   label = round(temp_comp_vec["Winter"], digits = 1)))+  
    geom_label(aes(x = "Spring",
                   y=temp_comp_vec["Spring"]+1,
                   label = round(temp_comp_vec["Spring"], digits = 1)))+  
    geom_label(aes(x = 1.4, y = 1.5, label = paste0("intercept=",round(temp, digits = 1))))+
    geom_label(aes(x = 2.4, y = 5, label = paste0("spring=+",round(temp_coef["Spring"], digits = 1))))+
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

Next up is the Summer effect, which the model estimates is  `r round(coef(season_model)[3], digits = 2)`. 
Again, this is *not* the estimated temperature in the summer. 
This is *how different Summer is estimated to be from the Intercept (Winter)*. 
To get the estimated Summer temperature from the model we need to add the Summer effect to the intercept (`r round(coef(season_model)[1], digits = 2)` + `r round(coef(season_model)[3], digits = 2)` = `r round(coef(season_model)[3], digits = 2) +round(coef(season_model)[1], digits = 2) `).




```{r echo = F}
temp_comp_means%>%
  filter(season %in% c("Winter", "Spring", "Summer"))%>%
  ggplot(aes(season, temp))+
    geom_hline(yintercept = 0, color = muted("blue"))+  
    geom_point()+
    geom_step(data= temp_comp_means %>% filter(season %in% c("Winter", "Summer")), 
              aes(group = 1))+
    geom_segment(aes(xend = "Winter", x = "Winter", y = 0, yend = temp_comp_vec["Winter"]), 
                 arrow = arrow())+
    geom_segment(aes(xend = "Summer", x = "Summer", y = temp_comp_vec["Winter"], yend = temp_comp_vec["Summer"]), 
                 arrow = arrow())+  
    geom_label(aes(x = season,
                   y=temp + 1,
                   label = round(temp, digits = 1)))+
    geom_label(aes(x = 1.4, y = 1.5, label = paste0("intercept=",round(temp_comp_vec["Winter"], digits = 1))))+
    geom_label(aes(x = 3.4, y = 10, label = paste0("summer=+",round(temp_coef["Summer"], digits = 1))))+
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

And, what should seem familiar by now, the effect of Autumn is `r round(coef(season_model)[4], digits = 2)`, meaning Autumn is `r round(coef(season_model)[4], digits = 2)` degrees warmer than Winter, meaning the estimated average temperature in Autumn is (`r round(coef(season_model)[1], digits = 2)` + `r round(coef(season_model)[4], digits = 2)` = `r round(coef(season_model)[4], digits = 2) +round(coef(season_model)[1], digits = 2) `).




```{r echo = F}
temp_comp_means%>%
  filter(season %in% c("Winter", "Spring", "Summer", "Autumn"))%>%
  ggplot(aes(season, temp))+
    geom_hline(yintercept = 0, color = muted("blue"))+  
    geom_point()+
    geom_step(data= temp_comp_means %>% filter(season %in% c("Winter", "Autumn")), 
              aes(group = 1))+
    geom_segment(aes(xend = "Winter", x = "Winter", y = 0, yend = temp_comp_vec["Winter"]), 
                 arrow = arrow())+
    geom_segment(aes(xend = "Autumn", x = "Autumn", y = temp_comp_vec["Winter"], yend = temp_comp_vec["Autumn"]), 
                 arrow = arrow())+  
    geom_label(aes(x = season,
                   y=temp + 1,
                   label = round(temp, digits = 1)))+
    geom_label(aes(x = 1.4, y = 1.5, label = paste0("intercept=",round(temp_comp_vec["Winter"], digits = 1))))+
    geom_label(aes(x = 3.6, y = 5, label = paste0("autumn=+",round(temp_coef["Autumn"], digits = 1))))+
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

Let's review the summary of the model again:


```{r echo = T}
summary(season_model)
```

We can say the following things about this model:

- The significant `(Intercept)` effect means the average temperature in Winter is significantly different from 0 (estimated to be 2.7).
- The significant `seasonSpring` effect means the difference between Spring and Winter is significantly different from 0 (Spring is estimated to be 5.3 degreees warmer than Winter.)
- The significant `seasonSummer` effect means the difference between Summer and Winter is significantly different from 0 (Summer is estimated to be 10.7 degreees warmer than Winter.)
- The significant `seasonAutumn` effect means the difference between Autumn and Winter is significantly different from 0 (Autumn is estimated to be 7 degreees warmer than Winter.)

And that is pretty much *all* we can say. Looking at these results, we can say that Spring is significantly warmer than Winter, but we *can't* say anything about a generic "effect of Spring."
In fact, *what would we mean by such a thing anyway?*

We also can't say anything about the difference between any of the other seasons.
For example, Spring and Autumn have very similar looking temperatures (their difference from Winter is very similar). We can't say anything about whether or not Spring is warmer or cooler than Autumn. 

# A more practical example - Scaling and transorming predictors

We're now going to walk through a real example from my own research invovling the vowel /ay/. In this example, we're going to be walking through 

<div class = "box break">
<span class = "big-label">Brief Background on /ay/</span>

The English vowel /ay/ is the vowel sound that appears in the English words *ride* and *price*. It is a "diphthong", meaning it is a sound with two different articulatory targets. In the dialect this data is drawn from, the first target is something like a low and back [ɑ], which appears in the words *hot* and *lot*, and the second target is something like a high and front [i:], which shows up in the words *fleece* and *heat*. We'll be focusing on just the first part of the diphthong, the typically low and back part.

Vowel sounds can be characterized using both articulatory and acoustic measures, and it has been found that certain acoustic features can be reliably correlated with the articulatory position of the tongue while the vowels are produced. This means we can characterize speakers' pronunciations of vowel sounds from just a recording of their speech. If we examine how these acoustic properties of vowels covary with speakers' ages or dates of birth, we can also estimate how pronunciations of these vowels have changed over time. 

These acoustic properties are called "formants", and the formant we are going focus on here is "F1", which covaries with the height of the tongue in the mouth. The higher F1, the *lower* in the the mouth the tongue is. It's also the case that F1 values systematically covary with the length of speakers' vocal tracts, so we are actually going to be looking at *normalized F1*, which attempts to factor out this effect.
</div>

For this illustration, we're going to first focus on the vowel /ay/ before voiceless consonants (i.e. /p, t, k, f, s/).
In Philadelphia, where this data is from, /ay/ underwent a fairly large shift in its pronunciation across the 20th century.

```{r}
ayT_means <- ay %>%
  filter(plt_vclass == "ay0" )%>%  # get the pre-voiceless tokens
  mutate(dob = year - age) %>%     # calculate speakers' dob
  group_by(idstring, dob) %>%      # prepare to estimate by-speaker means
  summarise(F1_n = mean(F1_n))     # means
```

```{r}
ayT_means %>%
  ggplot(aes(dob, F1_n))+
    geom_point()+
    theme_bw()
```

In this plot, each point represents one speaker.
Each speaker's average normalized F1 value is plotted against their date of birth (ranging from 1889 to 1998).
Fitting a linear model to this data appears easy, but actually poses us with interpretive challenges. 

```{r}
ayT_model1 <- lm(F1_n ~ dob, data = ayT_means)
```

```{r}
summary(ayT_model1)
```

Let's sort out what is being reported here in this model, starting with the intercept.

```
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.8100970  1.0232146   25.23   <2e-16 ***
```

We see here that the intercept is significantly different from 0, but what does this intercept mean? 
The intercept is the estimated value of the outcome variable when the predictor is 0, meaning the model is predicting a value of normalized F1 of about 26 for speakers born in the year 0.


```{r echo = F} 
ayT_means %>%
  ggplot(aes(dob, F1_n))+
    geom_point()+
    geom_segment(aes(y = 0, yend = 25.8, x=0, xend=0),
                 arrow = arrow())+
    theme_bw()
```

This is obviously problematic. 
First, this is a "significant" effect, but it extrapolates so far beyond the data that it's entirely meaningless. (Nobody was speaking "English" in the year 0.)
Second, this estimated normalized F1 value is not a plausible F1 value for *any* vowel. 
If you were to try to map this onto an articulatory strategy, speaker's tongue bodies would be down somewhere around their ankles.

Let's also look at the slope. 

```
              Estimate Std. Error t value Pr(>|t|)    
dob         -0.0129710  0.0005262  -24.65   <2e-16 ***
```

This means for every 1 unit increase in `dob` (i.e. 1 year), there is a decrease of -0.013 in normalized F1.


```{r echo = F} 
ayT_means %>%
  ggplot(aes(dob, F1_n))+
    geom_point()+
    geom_segment(aes(y = 0, yend = 25.8, x=0, xend=0),
                 arrow = arrow())+
    geom_abline(intercept = 25.8, slope = -0.013, color = "red")+
    theme_bw()
```



If we ignore the comically weird intercept, this slope is less problematic. The one major issue in it is its overspecificity.
This slope is defined in terms of a change in 1 year of date of birth, but in an important sense that's over precise. 
We wouldn't *really* want to make any strong inferences about the difference between speakers' born in 1969 and 1970, for example.

## Rescaling

We can address these problems with the modekl by *rescaling* the date of birth predictor.
Rescaling predictors is not only "allowed," it can be *crucial* when it comes to trying to fit mixed effects models!

<div class = "box idiom">
<span class = "label">Idiom</span>

Not to confuse things too much, but transormations you make to predictors can be lumped into two broad categories: *linear* and *non-linear*. 
This use of "linear" isn't related to linear regression in any way.
Instead, it describes the relationship between the original measure and the transformed measure.

#### Linear transformations are:

- addition, subtraction, multiplication, division

Proof by simulation:

```{r echo = T}
linear_df <- data_frame(original = rnorm(100, mean = 50),
                        addition = original + 10,
                        subtraction = original - 10,
                        multiplication = original * 10,
                        division = original/10)
```

<div style = "width:100%;float:left;">
<div style = "width:23%; float:left; margin:auto;">
```{r echo = F, fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(linear_df, aes(original, addition))+
    geom_point()+
    theme_minimal()+
    ggtitle("addition")
```
</div>
<div style = "width:23%;float:left;margin-left:1%;">
```{r echo = F,fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(linear_df, aes(original,subtraction))+
    geom_point()+
    theme_minimal()+
    ggtitle("subtraction")
```
</div>
<div style = "width:23%;float:left;margin-left:1%;">
```{r echo = F,fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(linear_df, aes(original,multiplication))+
    geom_point()+
    theme_minimal()+
    ggtitle("multiplication")
```
</div>
<div style = "width:23%;float:left;margin-left:1%;">
```{r echo = F,fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(linear_df, aes(original,division))+
    geom_point()+
    theme_minimal()+
    ggtitle("division")
```
</div>
</div>

Any combination of (only) linear transforms is a linear transform.

#### Non-linear transformations are:

- logarithms (any base), cosine, square root, etc.

Proof by simulation:
```{r echo = T}
nonlinear_df <- data_frame(original = rnorm(100, mean = 5),
                           log_ten = log10(original),
                           cosine = cos(original),
                           square_root = sqrt(original),
                           exponential = exp(original))
```

<div style = "width:100%;float:left;">
<div style = "width:23%; float:left; margin:auto;">
```{r echo = F, fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(nonlinear_df, aes(original, log_ten))+
    geom_point()+
    theme_minimal()+
    ggtitle("log10")
```
</div>
<div style = "width:23%;float:left;margin-left:1%;">
```{r echo = F,fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(nonlinear_df, aes(original,cosine))+
    geom_point()+
    theme_minimal()+
    ggtitle("cosine")
```
</div>
<div style = "width:23%;float:left;margin-left:1%;">
```{r echo = F,fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(nonlinear_df, aes(original,square_root))+
    geom_point()+
    theme_minimal()+
    ggtitle("square root")
```
</div>
<div style = "width:23%;float:left;margin-left:1%;">
```{r echo = F,fig.width = 1, fig.height = 1, out.width="100%"}
ggplot(nonlinear_df, aes(original,exponential))+
    geom_point()+
    theme_minimal()+
    ggtitle("exponential")
```
</div>
</div>

</div>

If we apply one, or many, *linear* transformations to the predictor variable, we won't affect the goodness of fit of the model, nor the significance or sign of that predictor.
But it will change the way we *interpret* the effects.

There are a few different kinds of transformations we can do. 
A very common kind of transformation is to **z-score** the data.
To z-score numeric data, you *subtract the mean* and *divide by the standard deviation*.
Here's how that looks in mathematical notation

$$
z = \frac{x-\mu}{\sigma}
$$

Where $\mu$ is the mean of the $x$ data, and $\sigma$ is the standard deviation.
Here's how that looks in practice. 
The date of birth of speakers has the following distribution of data, with the meand and $\pm1$ standard deviation annotated.



```{r echo = F}
mean_dob <- mean(ayT_means$dob)
sd_dob <- sd(ayT_means$dob)
ggplot(ayT_means, aes(dob))+
    geom_density(color = "black", fill = "grey50")+
    geom_vline(xintercept = mean_dob)+
    geom_label(aes(x = mean_dob, y = 0.006, label = "mean"))+
    geom_vline(xintercept = mean_dob + sd_dob, linetype = 2)+
    geom_label(aes(x = mean_dob + sd_dob, y = 0.006, label = "+1 sd"))+
    geom_vline(xintercept = mean_dob - sd_dob, linetype = 2)+
    geom_label(aes(x = mean_dob - sd_dob, y = 0.006, label = "-1 sd"))+    
    theme_minimal()+
    ggtitle("Original Scale")
```


By subtracting out the mean, we center this distribution on 0, and by dividing by the standard deviation, we shrink down the distribution so that 1 standard deviation from the mean is the value 1.
Here's how the result looks:



```{r echo = F}
mean_dob <- mean(ayT_means$dob)
sd_dob <- sd(ayT_means$dob)

ayT_means <- ayT_means%>%
                ungroup()%>%
                mutate(dob_z = (dob-mean(dob))/sd(dob))
ggplot(ayT_means, aes(dob_z))+
    geom_density(color = "black", fill = "grey50")+
    geom_vline(xintercept = 0)+
    geom_label(aes(x = 0, y = 0.15, label = "mean"))+
    geom_vline(xintercept = 1, linetype = 2)+
    geom_label(aes(x = 1, y = 0.15, label = "+1 sd"))+
    geom_vline(xintercept = -1, linetype = 2)+
    geom_label(aes(x = -1, y = 0.15, label = "-1 sd"))+    
    theme_minimal()+
    ggtitle("z-Scale")
```


### Why is this an improvement? 

Let's re-fit the model with this transformed DOB predictor and see:

```{r}
ayT_model2 <- lm(F1_n ~ dob_z, data = ayT_means)
summary(ayT_model2)
```

```
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.58862    0.01205   48.83   <2e-16 ***
```

The intercept is the estimated value of normalized F1 when `dob_z` is 0, which is the mean of dob, `r round(mean_dob)`.

```{r echo = F} 
ayT_means %>%
  ggplot(aes(dob, F1_n))+
    geom_point(color = "grey70")+
    geom_segment(aes(y = 0, yend = coef(ayT_model2)[1], x=mean_dob, xend=mean_dob),
                 arrow = arrow())+
    theme_bw()
```


```
            Estimate Std. Error t value Pr(>|t|)
dob_z       -0.29760    0.01207  -24.65   <2e-16 ***
```

The slope tells us how much normalized F1 changes for every change of 1 standard deviation of dob (`r round(sd_dob)` years)


For every 1 standard deviation shift in dob (`r round(sd_dob)` years), there is a decrease of 0.3 in normalized F1.

```{r echo = F}
z_step_df <- data_frame(dob = c(mean_dob - sd_dob, mean_dob, mean_dob + sd_dob),
                        F1_n = c(coef(ayT_model2)[1] - coef(ayT_model2)[2],
                                 coef(ayT_model2)[1],
                                  sum(coef(ayT_model2))))


ayT_means %>%
  ggplot(aes(dob, F1_n))+
    geom_point(color = "grey70")+
    geom_segment(aes(y = 0, yend = coef(ayT_model2)[1], x=mean_dob, xend=mean_dob),
                 arrow = arrow())+
    geom_step(data = z_step_df, direction = 'vh')+
    geom_point(data = z_step_df, color = "red")+
    theme_bw()
```

These values of the intercept and slope are more *interpretable*. 

### Other possible transformations

The z-score transformation isn't the *only* transformation you can use. It's just the most standard one. 
For example, Gelman & Hill (200XX) recommend using a modified z-score where you divide by $2\times\sigma$

$$
z = \frac{x-\mu}{2\sigma}
$$

The reason these formulas use the mean and standard deviation is because they are two theory neutral values that can be estimated for any set of numeric data.
However, it's possible that there are more sensible scaling factors for the data you're working with.
For example, in many sociolinguistic studies, the year 1960 is an important inflection point for many changes.
And `r round(sd_dob)` years doesn't have any intrinsic meaning, while something like 10 years is more intuitively accessible to most readers ("a decade").
So we could just as easilly choose this scaling formula for date of birth.


$$
decade = \frac{x - 1960}{10}
$$

### Are we really allowed to do all this rescaling?

Yes, but, **you must explain what you did in the write up**. For example:

> The date of birth predictor was centered on 1960, and divided by 10. The 1960 was chosen as the centering point because of its importance in previous sociolinguistic studies. Dividing dob by 10 gives the resulting slope the interpretation of the rate of change per decade.



# Multivariate Models & Interactions

We're quickly going to go over multivariate models (multiple predictors) and interactions.
The important thing to keep in mind when including multiple predictor is that your reference level and scaling of each predictor is going to start to matter a *lot* more.


## Continious and categorical predictors

Let's start with the /ay/ data. Previously we were looking only at pre-voiceless /ay/. But now we're going to look at all /ay/, with each token coded for whether it's pre-voiceless, or elsewhere. This is encoded in the `plt_vclass` column.

- `plt_vclass`
    - `ay0` - Pre-voiceless /ay/
    - `ay` - All other /ay/.
    
    
Because we're not dealing with mixed effects models yet, we'll first calculate by-speaker means. In the process I'll be re-scaling date of birth with two differing centering points, one on 1900, and one on 1960.


```{r}
ay_means <- ay %>%
              mutate(dob = year-age) %>%
              group_by(idstring, dob, plt_vclass) %>%
              summarise(F1_n = mean(F1_n))%>%
              mutate(decade_1900 = (dob - 1900)/10,
                     decade_1960 = (dob - 1960)/10,
                     voicing = factor(recode(plt_vclass, ay0 = "voiceless",
                                                         ay = "else")),
                     voicing = relevel(voicing, "else"))
```

```{r}
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point()+
    theme_bw()
```

## Two predictors, no interactions

We can see that there is a *big* difference between pre-voiceless and elsewhere /ay/. Let's start building up the model then


```{r}
ay_model1 <- lm(F1_n ~ decade_1900 + voicing, data = ay_means)
```

```{r}
summary(ay_model1)
```


Let's walk through the effects by drawing the graph that this model is a recipe for. First, we have the Intercept.

```
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.567577   0.020862   75.14   <2e-16 ***
```

This is the estimated average value of normalized F1 for when *all predictors are 0.* 
That means this is the predicted intercept when the centered dob is 0 (1900), for the *reference level* (elsewhere /ay/)


```{r echo = F}
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_segment(aes(x = 1900, xend = 1900, y = 0, 
                     yend = coef(ay_model1)[1], 
                     color = "else"),
                 arrow = arrow(),
                 show.legend = F)+
    theme_bw()
```

Next there's the slope: 

```
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.567577   0.020862   75.14   <2e-16 ***
```

There's a few different ways we could construe this slope. The value is the rate of change per decade, and we'll interpret this as the slope for reference level.

```{r echo = F}
badmod <- lm(F1_n ~ dob + voicing, data = ay_means)

ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_segment(aes(x = 1900, xend = 1900, y = 0, 
                     yend = coef(ay_model1)[1], 
                     color = "else"),
                 arrow = arrow(),
                 show.legend = F)+
    geom_abline(aes(intercept = coef(badmod)[1],
                slope = coef(badmod)[2],
                color = "else"),
                show.legend = F)+
    theme_bw()
```


Finally, there's the voicing parameter. 

```
                  Estimate Std. Error t value Pr(>|t|)    
voicingvoiceless -0.643357   0.017376  -37.02   <2e-16 ***
```

This tells us how different the *intercept* of the line for pre-voiceless /ay/ is from elsewhere /ay/.
Because we have included no other effects or interactions, the model is assuming that the slope of the pre-voiceless /ay/ is *identical* to elsewhere /ay/.


```{r echo = F}
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_segment(aes(x = 1900, 
                     xend = 1900, 
                     yend = coef(ay_model1)[1] + coef(ay_model1)[3] , 
                     y = coef(ay_model1)[1]),
                 arrow = arrow(),
                 color = "black",
                 show.legend = F)+
    geom_abline(aes(intercept = coef(badmod)[1],
                slope = coef(badmod)[2],
                color = "else"),
                show.legend = F)+
    geom_abline(aes(intercept = coef(badmod)[1] + coef(badmod)[3],
                slope = coef(badmod)[2],
                color = "voiceless"),
                show.legend = F)+  
    theme_bw()
```


Note, looking a this graph, you can see that this voiceless effect is **not** the estimated intercept of pre-voicelss /ay/. It is the estimated *difference* in intercepts between these two lines.

### Evaluating the model.

Let's look at the model summary again:

```{r}
summary(ay_model1)
```

If you only fit this model, and didn't plot any of the data, you might conclude:

- There is a significant change over decade for /ay/.
- There is a significant effect of voicing on /ay/.

## Including interactions

However, looking at the plot, it's clear something more complicated is going on. 
Neither of the lines look like they're passing through their related data points very well.
There is, in fact, an *interaction* between voicing and date of birth. 
Pre-voiceless /ay/ is undergoing a change, while elsewhere /ay/ is basically staying the same.
We can account for this by including this interaction in the model:


```{r}
ay_model2 <- lm(F1_n ~ decade_1900 * voicing, data = ay_means)
```

```{r}
summary(ay_model2)
```

Let's build up the graph.

```
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   1.326564   0.022219  59.703  < 2e-16 ***
```

This is the estimated average for when all preditors are 0 (i.e. dob = 1900, voicing = else).
```{r echo = F}
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_segment(aes(x = 1900, 
                     xend = 1900, 
                     yend = coef(ay_model2)[1] , 
                     y = 0,
                     color = "else"),
                 arrow = arrow(),
                 show.legend = F)+
    theme_bw()
```


```
                              Estimate Std. Error t value Pr(>|t|)    
decade_1900                  -0.021277   0.004443  -4.789 2.08e-06 ***
```

This is the estimated slope *for the reference level* (i.e. voicing = else)

```{r echo = F}
badmod2 <- lm(F1_n ~ dob * voicing, data = ay_means)

ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_segment(aes(x = 1900, 
                     xend = 1900, 
                     yend = coef(ay_model2)[1] , 
                     y = 0,
                     color = "else"),
                 arrow = arrow(),
                 show.legend = F)+
    geom_abline(aes(intercept = coef(badmod2)[1],
                    slope = coef(badmod2)[2],
                    color = "else"),
                show.legend = F)+
    theme_bw()
```

Now, we get into the more complicated coefficients.


```
                              Estimate Std. Error t value Pr(>|t|)    
voicingvoiceless             -0.161329   0.031423  -5.134 3.75e-07 ***
```

This is how different the line for pre-voiceless /ay/ is from elsewhere /ay/ *at the intercept* (i.e. when dob = 1900).

```{r echo = F}
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_abline(aes(intercept = coef(badmod2)[1],
                    slope = coef(badmod2)[2],
                    color = "else"),
                show.legend = F)+
    geom_segment(aes(y = coef(ay_model2)[1],
                     yend = coef(ay_model2)[1] + coef(ay_model2)[3],
                     x = 1900,
                     xend = 1900,
                     color = "voiceless"),
                 arrow = arrow(),
                 show.legend = F)+
    theme_bw()
```

Finally, we have the interaction effect.

```
                              Estimate Std. Error t value Pr(>|t|)    
decade_1900:voicingvoiceless -0.108433   0.006283 -17.257  < 2e-16 ***
```

This is the most easilly misinterpretable parameter in this model. 
It is *not* the estimated slope of pre-voiceless /ay/. It is *how different* pre-voiceless /ay/'s slope is from elsewhere /ay/. 
That is, if you drew a line parallel to elsewhere /ay/, thos is how much you'd have to rotate the the line to get pre-voiceless /ay/'s slope.

```{r echo = F}
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_abline(aes(intercept = coef(badmod2)[1],
                    slope = coef(badmod2)[2],
                    color = "else"),
                show.legend = F)+
    geom_segment(aes(y = coef(ay_model2)[1],
                     yend = coef(ay_model2)[1] + coef(ay_model2)[3],
                     x = 1900,
                     xend = 1900,
                     color = "voiceless"),
                 arrow = arrow(),
                 show.legend = F)+
    geom_abline(aes(intercept = coef(badmod2)[1] + coef(ay_model2)[3],
                    slope = coef(badmod2)[2],
                    color = "voiceless"),
                linetype = 2,
                show.legend = F)+
    geom_abline(aes(intercept = coef(badmod2)[1] + coef(badmod2)[3],
                    slope = coef(badmod2)[2] + coef(badmod2)[4],
                    color = "voiceless"),
                linetype = 1,
                show.legend = F)+  
    theme_bw()
```

### Be careful thinking about models with interactions

This is super important to keep in mind, let's say we ran this model, and got the following coefficients:

```
(Intercept)                   1.32    ***
decade_1900                  -10      ***
voicingvoiceless             -0.16    ***
decade_1900:voicingvoiceless  10      ***
```

The significant interaction between date of birth and voicing here would **not** mean that there is significant change for pre-voiceless /ay/ over time. 
It means there is a **significant difference** between the slope of elsewhere /ay/ and pre-voiceless /ay/. 
The estimated slope of elsewhere /ay/ in this fake model is -10, and for pre-voiceless /ay/ it's -10 + 10 = 0!

The other thing to keep in mind with interpreting the results of the model is that the `voicingvoiceless` effect is *not* an estimate of "the effect of voicing."
It is the estimated difference between the two lines *at the intercept*. 
If we change where the intercept is, for example by using a different scaling, we also change the value of this parameter. 

```{r}
summary(lm(F1_n ~ decade_1900 * voicing, data = ay_means))
```


```{r}
summary(lm(F1_n ~ decade_1960 * voicing, data = ay_means))
```

With the date of birth centered on 1900, we get a relatively weak effect (-0.16). 
With the date of birth centered on 1960, we get a much stronger effect (-0.81).
That's because there is a strong interaction between the date of birth and voicing context, and there is a bigger difference between these two lines in 1960 than in 1900!


```{r echo = F}
ay_model3 <- lm(F1_n ~ decade_1960 * voicing, data = ay_means)
ggplot(ay_means, aes(dob, F1_n, color = voicing))+
    geom_point(alpha = 0.3)+
    geom_abline(aes(intercept = coef(badmod2)[1],
                    slope = coef(badmod2)[2],
                    color = "else"),
                show.legend = F)+
    geom_segment(aes(y = coef(ay_model2)[1],
                     yend = coef(ay_model2)[1] + coef(ay_model2)[3],
                     x = 1900,
                     xend = 1900,
                     color = "voiceless"),
                 arrow = arrow(),
                 show.legend = F)+
    geom_segment(aes(y = coef(ay_model3)[1],
                     yend = coef(ay_model3)[1] + coef(ay_model3)[3],
                     x = 1960,
                     xend = 1960,
                     color = "voiceless"),
                 arrow = arrow(),
                 show.legend = F)+  
    geom_abline(aes(intercept = coef(badmod2)[1] + coef(badmod2)[3],
                    slope = coef(badmod2)[2] + coef(badmod2)[4],
                    color = "voiceless"),
                linetype = 1,
                show.legend = F)+  
    theme_bw()
```

It is, in fact, more than possible for a predictor like this to have an estimated value close to 0, and be non-significant, but for it to actuallty matter a *lot* to the model.

**Lessons**

- Always think carefully about the *meaning* of predictors in a model with interactions.
- Always try to draw the graph out.
- Always remember the parameters are estimates of the *difference* from the reference levels.
- Non-significant != not important.


## Categorical interactions.

We'll return to the weather data to examine the interpretation of categorical interactions. 
We'll be using season and a categorical coding of year.

```{r}
temps_comp <- temps_comp %>%
                  mutate(year_factor = factor(year))
```

Fitting the model is straight forward.
```{r}
temps_model <- lm(temp ~ season * year_factor, data = temps_comp)
```

```{r}
summary(temps_model)
```

There's a lot going on here, and it is very easy to misinterpret the results if you don't try to draw out the graph. 
We'll take the coefficients in batches. First, there is the "intercept."

```
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    1.2151     0.3092   3.929 9.37e-05 ***
```

This is the estimated average temperature for the references levels, that is, for season = Winter, year = 2010.


```{r echo= F}
temp_comp_means%>%
  filter(season == "Winter")%>%
  ggplot(aes(season, temp))+
    geom_blank()+
    geom_hline(yintercept = 0, color = muted("blue"))+  
    geom_segment(aes(x = 0.9,
                     xend = 0.9,
                     y = 0,
                     yend = coef(temps_model)[1]),
                 arrow = arrow())+
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

Next, there are the season effects.
```
                             Estimate Std. Error t value Pr(>|t|)    
seasonSpring                   6.1661     0.4603  13.395  < 2e-16 ***
seasonSummer                  12.7472     0.4385  29.067  < 2e-16 ***
seasonAutumn                   7.3150     0.4451  16.436  < 2e-16 ***
```

Recall, these are the estimated *differences* between Winter and the named season. But because there is an interaction effect between season and year, these values are even more specific: these are the estimated differences between each of these named seasons and Winter **in 2010**, because 2010 is the reference level for year.



```{r echo= F}

spring <- data.frame(season = c(0.9, 1.9),
                     temp = c(coef(temps_model)[1], 
                              coef(temps_model)[1] + coef(temps_model)[2]))
summer <- data.frame(season = c(0.9, 2.9),
                     temp = c(coef(temps_model)[1], 
                              coef(temps_model)[1] + coef(temps_model)[3]))
fall <- data.frame(season = c(0.9, 3.9),
                     temp = c(coef(temps_model)[1], 
                              coef(temps_model)[1] + coef(temps_model)[4]))

temp_comp_means%>%
  filter(season == "Winter")%>%
  ggplot(aes(season, temp))+
    geom_blank()+
    geom_hline(yintercept = 0, color = muted("blue"))+  
    geom_segment(aes(x = 0.9,
                     xend = 0.9,
                     y = 0,
                     yend = coef(temps_model)[1]),
                 arrow = arrow())+
  geom_label(aes(x = 0.9,
                   y = coef(temps_model)[1]+1,
                   label = "2010"),
               size = 2.5)+    
    geom_step(data = spring, group = 1)+
    geom_segment(aes(x = 1.9,
                     xend = 1.9,
                     y = coef(temps_model)[1],
                     yend = coef(temps_model)[1] + coef(temps_model)[2]),
                 arrow = arrow())+
    geom_label(aes(x = 1.9,
                   y = coef(temps_model)[1] + coef(temps_model)[2]+1,
                   label = "2010"),
               size = 2.5)+  
    geom_segment(aes(x = 1.9,
                     xend = 1.9,
                     y = coef(temps_model)[1],
                     yend = coef(temps_model)[1] + coef(temps_model)[2]),
                 arrow = arrow())+ 
    geom_label(aes(x = 2.9,
                   y = coef(temps_model)[1] + coef(temps_model)[3]+1,
                   label = "2010"),
               size = 2.5)+  
    geom_segment(aes(x = 2.9,
                     xend = 2.9,
                     y = coef(temps_model)[1],
                     yend = coef(temps_model)[1] + coef(temps_model)[3]),
                 arrow = arrow())+    
    geom_label(aes(x = 3.9,
                   y = coef(temps_model)[1] + coef(temps_model)[4]+1,
                   label = "2010"),
               size = 2.5)+  
    geom_segment(aes(x = 3.9,
                     xend = 3.9,
                     y = coef(temps_model)[1],
                     yend = coef(temps_model)[1] + coef(temps_model)[4]),
                 arrow = arrow())+   
    geom_step(data = summer, group = 1)+  
    geom_step(data = fall, group = 1)+  
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

Next is the year effect

```
                             Estimate Std. Error t value Pr(>|t|)    
year_factor2011                3.0264     0.4373   6.920 1.03e-11 ***
```

Again, because there is an interaction effect between season and year, this is estimated difference between Winter 2010 and Winter 2011. I'm drawing this on the graph for winter in a solid line, and for all of the other seasons in a dashed line. As will be made clear in a moment, this is is because the interaction effects can only be interpreted in relation to these dashed lines.

```{r echo= F}
winter_year <- data.frame(season = c(0.9, 1.1),
                          temp = c(coef(temps_model)[1], 
                                   coef(temps_model)[1] + coef(temps_model)[5]))
spring_year <- data.frame(season = c(1.9, 2.1),
                          temp = c(coef(temps_model)[1] + coef(temps_model)[2], 
                                   coef(temps_model)[1] + coef(temps_model)[2]
                                   + coef(temps_model)[5]))
summer_year <- data.frame(season = c(2.9, 3.1),
                          temp = c(coef(temps_model)[1] + coef(temps_model)[3], 
                                   coef(temps_model)[1] + coef(temps_model)[3]
                                   + coef(temps_model)[5]))
autumn_year <- data.frame(season = c(3.9, 4.1),
                          temp = c(coef(temps_model)[1] + coef(temps_model)[4], 
                                   coef(temps_model)[1] + coef(temps_model)[4]
                                   + coef(temps_model)[5]))

temp_comp_means%>%
  filter(season == "Winter")%>%
  ggplot(aes(season, temp))+
    geom_blank()+
    geom_hline(yintercept = 0, color = muted("blue"))+  
  geom_step(data = winter_year)+
  geom_step(data = spring_year, linetype = 2)+  
  geom_step(data = summer_year, linetype = 2)+  
  geom_step(data = autumn_year, linetype = 2)+    
  geom_segment(aes(x = 1.1, xend = 1.1,
                   y = coef(temps_model)[1],
                   yend =  coef(temps_model)[1] + coef(temps_model)[5]),
               arrow = arrow())+
  geom_label(aes(x = 1.1,
                   y = coef(temps_model)[1] + coef(temps_model)[5] + 1,
                   label = "2011"),
               size = 2.5)+      
  geom_label(aes(x = 0.9,
                   y = coef(temps_model)[1],
                   label = "2010"),
               size = 2.5)+    
    geom_label(aes(x = 1.9,
                   y = coef(temps_model)[1] + coef(temps_model)[2],
                   label = "2010"),
               size = 2.5)+  
    geom_label(aes(x = 2.9,
                   y = coef(temps_model)[1] + coef(temps_model)[3],
                   label = "2010"),
               size = 2.5)+  
    geom_label(aes(x = 3.9,
                   y = coef(temps_model)[1] + coef(temps_model)[4],
                   label = "2010"),
               size = 2.5)+  
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

Finally, there are the interation terms!


```
                             Estimate Std. Error t value Pr(>|t|)    
seasonSpring:year_factor2011  -1.8707     0.6333  -2.954  0.00324 ** 
seasonSummer:year_factor2011  -4.0215     0.6176  -6.511 1.43e-10 ***
seasonAutumn:year_factor2011  -0.7270     0.6231  -1.167  0.24375    
```

The effect of year in the model was about 3, meaning Winter 2011 was 3 degrees warmer than Winter 2010. If we look at the Spring:Year interaction here, it has a value of -1.9, meaning Spring 2011 was 3-1.9 = `r 3-1.9` degree warmer than Spring 2010.

That is, these interaction terms are telling us how different the year effect for each season is from the year effect for Winter, our reference level. Just because Spring:2011 has a significant negative value doesn't mean Spring 2011 was colder than Spring 2010! It just means that the difference between Spring 2010 and Spring 2011 was not as big as that between Winter 2010 and Winter 2011!

The significant Spring:2011 effect also **does not mean** that Spring 2011 was a significantly different temperature than Spring 2010!

```{r echo= F}
temp_comp_means%>%
  filter(season == "Winter")%>%
  ggplot(aes(season, temp))+
    geom_blank()+
    geom_hline(yintercept = 0, color = muted("blue"))+  
  geom_step(data = winter_year)+
  geom_step(data = spring_year, linetype = 2)+  
  geom_step(data = summer_year, linetype = 2)+  
  geom_step(data = autumn_year, linetype = 2)+    
  geom_segment(aes(x = 1.1, xend = 1.1,
                   y = coef(temps_model)[1],
                   yend =  coef(temps_model)[1] + coef(temps_model)[5]),
               arrow = arrow())+
  geom_label(aes(x = 1.1,
                   y = coef(temps_model)[1] + coef(temps_model)[5] + 1,
                   label = "2011"),
               size = 2.5)+ 

  geom_segment(aes(x = 2.1, xend = 2.1,
                   y = coef(temps_model)[1] + coef(temps_model)[2]+
                                   coef(temps_model)[5],
                   yend =  coef(temps_model)[1] + coef(temps_model)[2]+
                                   coef(temps_model)[5] + coef(temps_model)[6]),
               arrow = arrow())+  
    geom_label(aes(x = 2.1,
                   y = coef(temps_model)[1] + coef(temps_model)[2]+
                                   coef(temps_model)[5] + coef(temps_model)[6]-1,
                   label = "2011"),
               size = 2.5)+      
  
    geom_segment(aes(x = 3.1, xend = 3.1,
                   y = coef(temps_model)[1] + coef(temps_model)[3]+
                                   coef(temps_model)[5],
                   yend =  coef(temps_model)[1] + coef(temps_model)[3]+
                                   coef(temps_model)[5] + coef(temps_model)[7]),
               arrow = arrow())+  
    geom_label(aes(x = 3.1,
                   y = coef(temps_model)[1] + coef(temps_model)[3]+
                                   coef(temps_model)[5] + coef(temps_model)[7]-1,
                   label = "2011"),
               size = 2.5)+    


      geom_segment(aes(x = 4.1, xend = 4.1,
                   y = coef(temps_model)[1] + coef(temps_model)[4]+
                                   coef(temps_model)[5],
                   yend =  coef(temps_model)[1] + coef(temps_model)[4]+
                                   coef(temps_model)[5] + coef(temps_model)[8]),
               arrow = arrow())+  
    geom_label(aes(x = 4.1,
                   y = coef(temps_model)[1] + coef(temps_model)[4]+
                                   coef(temps_model)[5] + coef(temps_model)[8]-1,
                   label = "2011"),
               size = 2.5)+    
    
  
  
    geom_label(aes(x = 0.9,
                   y = coef(temps_model)[1],
                   label = "2010"),
               size = 2.5)+    
    geom_label(aes(x = 1.9,
                   y = coef(temps_model)[1] + coef(temps_model)[2],
                   label = "2010"),
               size = 2.5)+  
    geom_label(aes(x = 2.9,
                   y = coef(temps_model)[1] + coef(temps_model)[3],
                   label = "2010"),
               size = 2.5)+  
    geom_label(aes(x = 3.9,
                   y = coef(temps_model)[1] + coef(temps_model)[4],
                   label = "2010"),
               size = 2.5)+  
    xlim("Winter", "Spring", "Summer", "Autumn")+
    ylim(temp_range[1], temp_range[2])+
    theme_bw()
```

## Next Week

- Introducing Mixed Effects Models
